Does Unlabeled Data Provably Help? Worst-case Analysis of the Sample Complexity of Semi-Supervised Learning
نویسندگان
چکیده
We study the potential benefits of unlabeled data to classification prediction to the learner. We compare learning in the semi-supervised model to the standard, supervised PAC (distribution free) model, considering both the realizable and the unrealizable (agnostic) settings. Roughly speaking, our conclusion is that access to unlabeled samples cannot provide sample size guarantees that are better than those obtainable without access to unlabeled data, unless one postulates very strong assumptions about the distribution of the labels. In particular, we prove that for basic hypothesis classes over the real line, if the distribution of unlabeled data is ‘smooth’, knowledge of that distribution cannot improve the labeled sample complexity by more than a constant factor (e.g., 2). We conjecture that a similar phenomena holds for any hypothesis class and any unlabeled data distribution. We also discuss the utility of semi-supervised learning under the common cluster assumption concerning the distribution of labels, and show that even in the most accommodating cases, where data is generated by two uni-modal label-homogeneous distributions, common SSL paradigms may be misleading and inflict poor prediction performance.
منابع مشابه
Effective Semisupervised Learning on Manifolds
The abundance of unlabeled data makes semi-supervised learning (SSL) an attractive approach for improving the accuracy of learning systems. However, we are still far from a complete theoretical understanding of the benefits of this learning scenario in terms of sample complexity. In particular, for many natural learning settings it can in fact be shown that SSL does not improve sample complexit...
متن کاملMaximum margin semi-supervised learning with irrelevant data
Semi-supervised learning (SSL) is a typical learning paradigms training a model from both labeled and unlabeled data. The traditional SSL models usually assume unlabeled data are relevant to the labeled data, i.e., following the same distributions of the targeted labeled data. In this paper, we address a different, yet formidable scenario in semi-supervised classification, where the unlabeled d...
متن کاملUnlabeled Data Does Provably Help
A fully supervised learner needs access to correctly labeled examples whereas a semi-supervised learner has access to examples part of which are labeled and part of which are not. The hope is that a large collection of unlabeled examples significantly reduces the need for labeled-ones. It is widely believed that this reduction of “label complexity” is marginal unless the hidden target concept a...
متن کاملFinite sample analysis of semi-supervised learning
Empirical evidence shows that in favorable situations semi-supervised learning (SSL) algorithms can capitalize on the abundance of unlabeled training data to improve the performance of a learning task, in the sense that fewer labeled training data are needed to achieve a target error bound. However, in other situations unlabeled data do not seem to help. Recent attempts at theoretically charact...
متن کاملAn Augmented PAC Model for Semi- Supervised Learning
The standard PAC-learning model has proven to be a useful theoretical framework for thinking about the problem of supervised learning. However, it does not tend to capture the assumptions underlying many semi-supervised learning methods. In this chapter we describe an augmented version of the PAC model designed with semi-supervised learning in mind, that can be used to help think about the prob...
متن کامل